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Abstract 

Background: Functionally relevant artificial or natural mutations are difficult to assess or predict if no structure- 
function information is available for a protein. This is especially important to correctly identify functionally 
significant non-synonymous single nucleotide polymorphisms (nsSNPs) or to design a site-directed mutagenesis 
strategy for a target protein. A new and powerful methodology is proposed to guide these two decision strategies, 
based only on conservation rules of physicochemical properties of amino acids extracted from a multiple 
alignment of a protein family where the target protein belongs, with no need of explicit structure-function 
relationships. 

Results: A statistical analysis is performed over each amino acid position in the multiple protein alignment, based 
on different amino acid physical or chemical characteristics, including hydrophobicity, side-chain volume, charge 
and protein conformational parameters. The variances of each of these properties at each position are combined 
to obtain a global statistical indicator of the conservation degree of each property. Different types of 
physicochemical conservation are defined to characterize relevant and irrelevant positions. The differences between 
statistical variances are taken together as the basis of hypothesis tests at each position to search for functionally 
significant mutable sites and to identify specific mutagenesis targets. The outcome is used to statistically predict 
physicochemical consensus sequences based on different properties and to calculate the amino acid propensities 
at each position in a given protein. Hence, amino acid positions are identified that are putatively responsible for 
function, specificity, stability or binding interactions in a family of proteins. Once these key functional positions are 
identified, position-specific statistical distributions are applied to divide the 20 common protein amino acids in 
each position of the protein's primary sequence into a group of functionally non-disruptive amino acids and a 
second group of functionally deleterious amino acids. 

Conclusions: With this approach, not only conserved amino acid positions in a protein family can be labeled as 
functionally relevant, but also non-conserved amino acid positions can be identified to have a physicochemically 
meaningful functional effect. These results become a discriminative tool in the selection and elaboration of rational 
mutagenesis strategies for the protein. They can also be used to predict if a given nsSNP, identified, for instance, in 
a genomic-scale analysis, can have a functional implication for a particular protein and which nsSNPs are most 
likely to be functionally silent for a protein. This analytical tool could be used to rapidly and automatically discard 
any irrelevant nsSNP and guide the research focus toward functionally significant mutations. Based on preliminary 
results and applications, this technique shows promising performance as a valuable bioinformatics tool to aid in 
the development of new protein variants and in the understanding of function-structure relationships in proteins. 
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Background 

Site-directed mutagenesis is a tool used in rational pro- 
tein design strategies to modify the structure or function 
of a protein to adapt it to particular performance 
requirements. Moreover, mutagenesis is a fundamental 
tool to study the relationship between protein structure 
and function, making possible the substitution of one 
amino acid by another, thus isolating the contribution 
of the original amino acid or the newly introduced 
amino acid to the structure and function of the protein 
as a whole [1,2]. 

However, site-directed mutagenesis-based rational 
protein design strategies present a widely recognized 
drawback. In order to introduce changes that could con- 
fer a desired function or characteristic to a protein, it is 
necessary to know, or at least to assume, something 
about the protein structure-function relationship. In 
other words, it is necessary to know, for each amino 
acid of the protein or at least for a select group of them, 
what is their particular contribution to the structure and 
function of the protein as a whole. 

Even for a small protein, and assuming that only a 
subset of amino acids contribute to features really deter- 
minant for its relevant function, the number of amino 
acids to be considered is very large. For a medium-sized 
protein and with no additional information regarding 
the possible structure-function relationship, an exhaus- 
tive search is practically impossible [3]. 

It is commonly known that there are certain amino 
acids in a protein that are necessary and fundamental 
for its activity, function or structure, and there are other 
amino acids that are readily replaceable by amino acids 
sharing a common characteristic, without affecting the 
main features of the protein. Therefore, certain amino 
acidic positions must conserve some unique properties, 
which are communicated to the entire molecule, for a 
given protein or protein family. When analyzing a pro- 
tein family, many functionally important residues of pro- 
teins can be identified because they have been conserved 
during evolution. However, residues that vary can also 
be critically important if their variation is responsible 
for diversity of protein function and improved pheno- 
types. This adds an entirely new complexity level to the 
analysis [4,5]. 

The same type of functional variation can be observed in 
nature, represented by single nucleotide polymorphisms 
(SNPs) in the coding region of a given gene. SNPs can be 
synonymous, often called silent mutations, or can substi- 
tute a particular amino acid for another in a protein pri- 
mary sequence, which is referred as to a non-synonymous 
SNP (nsSNP). Prediction of the occurrence of nsSNPs in a 
gene could be easily done by comparing nucleotide 
sequences and detecting or predicting nucleotide changes 



that occur with low-probability incidence. However, 
predicting which of these mutations will have an observa- 
ble effect on protein function is a much more difficult 
task. This is complicated by the fact that there is often no 
structure-function knowledge available about the protein. 
This variation in protein function can be subtle or lead to 
major phenotypic changes in living organisms. For 
instance, variations in the DNA sequences of genes can 
affect how humans develop diseases and respond to patho- 
gens, chemicals, drugs, vaccines, and other agents. 

When no structure-function information is available, it 
is relevant to determine these features only from the 
protein's amino acid sequence [5]. A number of bioinfor- 
matic algorithms have been devised previously to extract 
this type of information from the primary sequence of a 
protein [4-8] . Some of these methods have been success- 
fully used for the prediction of altered protein pheno- 
types caused by nsSNPs in protein genes [7-9]. Other 
methods have been described to derive this information 
from 3D structural data or molecular dynamics results 
[10]. A different approach that incorporates machine- 
learning techniques has been used to study the results of 
directed evolution experiments in order to explore pro- 
teins and to derive hidden structural rules [11]. 

For this large-scale analysis, we propose an alternative, 
taking into account the contribution of each amino acid 
to the general structure of the protein through their 
characteristic physicochemical properties, such as 
hydrophobicity, side-chain volume and charge. Each 
amino acid contributes with its own physicochemical 
characteristics to the entire protein, which adds to the 
characteristics of the other amino acids, thus determin- 
ing the relevant features of the protein, both global and 
site-specific [2]. 

In this study, we have developed a general algorithm, 
named Mutagenesis Objective Search and Selection 
Tool (MOSST), which analyzes the target protein as 
part of a multiple alignment to determine which are the 
positions that could be mutated with or without altering 
the common characteristics of the protein family, and 
gives mutagenesis estimations related to the possibility 
of whether a given amino acid change would have 
deleterious effects on the protein. A variant of the same 
method can be used to detect phenotypically relevant 
nsSNPs in a gene family and separate them from amino 
acid substitutions that do not have functional 
implications. 

Results and Discussion 

Algorithm 

As a summary of this work, the general algorithmic 
procedure is presented in Figure 1. The left end set of 
the global algorithm corresponds to the information 
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Sequence of the 
analyzed protein 



PROPROCESSING OF 
THE ANALYZED PROTEIN 



1 . Identify a representative group of similar proteins that 
have interesting common features 



MULTIPLE ALIGNMENT 
USING EXTERNAL ALGORITHM 



2. Perform a multiple 
alignment of proteins 



3. Remove redundancy from 
the multiple alignment 



MOSST ALGORITHM 



4. Calculate the average and variance for each 
property at every alignment position 



5. Calculate the significance of the variance for 
each property in every alignment position 



6. Calculate the differences of the variances 
between all the property pairs at each position 



7. Calculate the partial significance of 
each difference at each position 



8. Combine the partial significances in a 
single global significance for each position 



MOSST RESULTS 
FOR DESIGN OF 
MUTAGENESIS STRATEGIES 




Functionally non-disruptive 
and functionally 
deleterious mutations 



9. Estimate a critical value for the 
rejection region and the assignment 
of categories to global significances 



1 0. Determine the significant 
and very significant positions in 
the alignment 



1 1 . Calculate the sample probabilities 
of the presence of each amino acid 
at each position 



1 2. Identify positions with low 
global significance and low 
significance of the variances (IDPs) 



MOSST RESULTS 
FOR IDENTIFICATION 
OFnsSNPs 





Functionally non-disruptive 
► and functionally 

deleterious mutations 



Figure 1 Flow diagram of the whole MOSST algorithm. The algorithm can be alternatively used for the development of rational protein 
design strategies or for the identification of functionally significant nsSNPs. 



delivered to select and design site-directed mutagenesis 
strategies, while the right end set represents the key 
information used to identify functionally significant 
nsSNPs using the tools and statistical procedures pro- 
posed in this paper. Both analytic procedures share the 
same initial steps, but have subtle differences that are 



exploited to get the most appropriate results for each 
application. 

Background of the algorithm 

A general consensus in protein science states that some 
amino acid physicochemical properties are conserved at 
particular positions in a given protein family and some 
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others are not. This explains the possible variation that 
can be found in any protein family to achieve the same 
function and it also shows that, despite the possibility of 
variation, there are particular characteristics that must 
be retained in order for the protein to conserve its own 
distinctive features. 

Each macroscopic property of a protein is controlled 
by different amino acids in its sequence with a higher or 
lower degree of influence. Hence, both within the frame- 
work of site-directed mutagenesis experiments and in 
the detection and prediction of phenotypic effects of 
nsSNPs, it is interesting to identify amino acid residues 
that are relevant for a protein's function, specificity, sta- 
bility or binding interactions. 

The proposed algorithm has been designed to analyze 
the general conservation of each relevant physicochem- 
ical characteristic at each amino acid position in a pro- 
tein family (e.g. hydrophobicity, charge, volume and 
shape). In this context, a protein family is predefined by 
selecting relevant proteins according to arbitrary criteria 
(e.g. a certain function in the case of an enzyme family, 
or the conservation of a certain structure in the case of 
a conserved protein fold). A protein family can also be 
subdivided into subfamilies, in order to study the contri- 
bution of particular amino acids to differences between 
subfamilies. 

Preprocessing of the target sequence 

To compare the proteins belonging to a family, a multi- 
ple alignment of the amino acid sequences is carried 
out. The most used algorithm for these multiple align- 
ments is ClustalW, developed by Thompson et al. [12] 
In a multiple alignment, the algorithm tends to retain 
and conserve the amino acids in each position, using a 
weight matrix (for example, Blosum matrices [13] and 
the Gonnet matrix [14]). An optimal alignment is 
obtained using these general matrices, which guarantee 
that the properties as a whole are maximally conserved 
at each amino acid position of the alignment. 

The preprocessing of the target protein sequence is 
represented by steps 1 and 2 in Figure 1, The MOSST 
algorithm subsequently analyzes this multiple alignment 
to determine and quantify the statistical significance of 
the conservation of different physicochemical properties 
in particular positions of the protein family alignment. 
To perform this analysis, the conservation of each prop- 
erty is independently studied and patterns are identified 
by comparing the conservation of different properties at 
each amino acid position. Typical conservation cases are 
defined based on these conservation patterns, which are 
then used to statistically classify the amino acids as 
relevant or irrelevant with respect to the conservation 
of any physicochemical property in the protein and the 
predicted amino acid mutability in the protein 
family. Then the statistical parameters (probabilities) 



determined for the conservation of each amino acid 
position are used as predictors to classify possible muta- 
tions in the protein family as functionally impairing or 
functionally silent. Hence, these results can then be 
interpreted and sorted to design mutagenesis strategies 
or to identify nsSNPs. 

Control of the quality of the multiple alignment 

The proposed method allows identification of function- 
ally relevant positions in a particular protein when the 
protein is placed within a comparative reference group 
in a multiple alignment. The classification procedure 
gives unique results for each particular alignment, but 
the statistical result for each position will strongly 
depend on the quality of the alignment and nothing 
prevents a priori this significance level varying between 
one alignment and another. 

This variation must be controlled so that the compari- 
son gives coherent and reproducible results when con- 
fronted to minor changes in the comparison alignment. 
The proposed control of this variation, represented by 
step 3 in Figure 1, is done in two ways: 
Alignment quality control The quality of a multiple 
alignment specifically depends on the substitution 
matrices used and the negative weight contributions of 
gap creation and extension. In the test performed to 
check the proposed algorithm, the alignments obtained 
using default ClustalW parameters gave satisfactory 
results. 

Control of the representativity and the redundancy of 
the alignment In the proposed analysis algorithm, the 
group of proteins in the alignment is supposed to be 
representative of a wider protein group. If the protein 
group does not include representatives of some 
subgroup that has some special feature, then such fea- 
ture evidently cannot be included in the analysis. This 
sub-representation is impossible to detect by purely 
mathematical methods, but can be solved by judiciously 
choosing the proteins to be represented and analyzed, 
which is a task that has to be done by a suitable expert. 
On the other hand, if a subgroup of proteins is over- 
represented in the analyzed group, then the protein 
sample will be biased towards the characteristics of the 
most represented subgroup, ignoring the characteristics 
of the other subgroups. The over-representation of a 
subgroup in the sample is readily detectable, as it is 
possible to assess the percentage of similarity between 
all the proteins in a group. If a particular group shows a 
high similarity degree, i.e. high redundancy, then it is 
over-represented in the sample, and the number of 
proteins in that group could be reduced by eliminating 
the most similar proteins. 

Once a representative and non-redundant group of 
proteins is found, MOSST can be used to detect func- 
tionally relevant positions. 
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Statistical quantification of the conservation of properties 

Amino acids have physical and chemical properties that 
can be directly or indirectly measured, e.g. hydrophobi- 
city, polarity, charge, molecular shape, conformational 
propensity and others. A numerical value is typically 
assigned for the measured property to each amino acid, 
whether free or associated to others in a polypeptide 
chain. As side chains are different for each of the 20 
naturally occurring amino acids, these numerical values 
are different for each amino acid and provide a quanti- 
tative ordering between them with respect to the mea- 
sured property. Hence, these property values form a 
scale that represents a numerical scoring for each amino 
acid. Many of these amino acid properties have been 
compiled in multiple non-independent "amino acid 
scales" [15-50]. 

In this way, every amino acid is assigned a numerical 
score that quantifies a particular physicochemical prop- 
erty £1 (for nomenclature, see Table 1). Hence, any 
group of n amino acids will have n associated scores, 
X a> i..JC a>n . From these scores, an arithmetical average 
fi a and a sample variance estimator O a can be calcu- 
lated (Figure 1, step 4) as: 

1 " 

Ha = -Y l x W (1) 

i=i 

1 " 

ff , 2 = ^TE( X ^-^) 2 (2) 
i=i 

A multiple alignment of protein amino acid sequences, 
such as that obtained by using ClustalW, comprises a 
given number of N positions. Each position i contains a 
group of Mj optimally aligned amino acids, each from a 
different protein (Figure 2). As the multiple alignment 
at position i can include gaps, K; is not always equal to 
the number of proteins in the alignment. For example, 
in the ninth position (i = 9) of the alignment in Figure 
2, m 9 = 6. Hence, for any given amino acid property £1, 
an average fii,n,n and a variance <Ji,n,n can he calculated 
for each position i in a multiple alignment with n 
amino acids, using Eq.l and Eq.2, respectively. 

The variance O i>a>n , as its name suggests, quantifies 
the variability of the property £1 in position i. If Oi,n,« is 
small, a large variation in the property £1 is not allowed 
in position i, and vice versa. Therefore, (7/,n,M can be 
used to quantify the conservation of the property 12 at 
that position. However, CT l; a,M depends on the scale used 
to measure £1 and on the number n of amino acids 
compared. Hence, it cannot be used to compare the 
conservation of different properties or different quantifi- 
cation scales, or to compare the conservation between 
positions with different numbers of amino acids. 



A different scale-free and position-free standard para- 
meter has to be calculated to quantify and compare the 
conservation of properties in multiple alignments. 

In the general case, there are 20 different amino acids 
and therefore for a given number of n amino acids there 
are 20" different random amino acid combinations. For 
any given property, it is possible to calculate all the 20" 
possible combinations of n amino acids and their asso- 
ciated averages and variances. If amino acid combina- 
tions are random, any particular combination will have 
a probability of occurrence equal to 1/20". Therefore, 
both the average and variance associated to each combi- 
nation will have a probability of occurrence equal to 1/ 
20". Particularly, for a random combination of n amino 
acids and any given amino acid property £1, a discrete 
probability mass function (PMF nt „) for the variance can 
be constructed. The function PMF^^CJ = x) gives the 
probability that the variance (7 of the property £1 calcu- 
lated for a group of n amino acids is exactly equal to x 
[51]. 

From the PMF ain for the variance of the property £1, 
a cumulative distribution function (CDF ntH ) can be 
obtained [51]. As shown in Figure 3, if the variance G a 
= x, then CDF a>n (ffn = x) = p& is the probability of 
the variance of a random combination of n amino acids 
being between 0 and x. Thus, if Pi,n,n m a given posi- 
tion i in a multiple alignment is small, then the prob- 
ability of said variance having a value equal or less than 
* is very small in comparison with the probabilities of 
all the possible variances in a random combination of n 
amino acids, and vice versa. For example, a value of 
Pi,a,n = 0-05 means that the obtained variance <7 iiS i t „ (or 
a lower value) only occurs in a proportion of 1 to 20 (1/ 
20 = 0.05) in random combinations of n amino acids. 

The significance of variances for every position i in 
the target multiple protein alignment are calculated in 
step 5 of the algorithm (Figure 1). Notably, if Pi,a.,n m a 
given position i in a multiple alignment is small, the 
variance CT /ji2 , M of the amino acid property under analysis 
in that position is simultaneously small and relatively 
uncommon, i.e. the amino acid selection in that position 
is not random. Hence, a small Pi,n,n implies that the 
property value must be relatively invariable (i.e. con- 
served) at position i, and vice versa. The closer to 0 is 
the value of Pi f a,nt the more significant this variance is 
and the less random the amino acid group is at position 
i. 

The probability Pi t n,n is an indicator of the degree of 
conservation of any property £1 at each position i in a 
multiple alignment, contrary to the use of the variance 
alone. The advantages of this conservation measure are, 
firstly, its independence with regard to the number n of 
amino acids in the comparison (including the presence 
of gaps) and, secondly, the possibility of performing 
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Table 1 Nomenclature and abbreviations 



Variable name Description 

CDFx l S2n+1 Posterior cumulative distribution function for the values of the property fl at position /, given the previously known group of n 
amino acids present in the multiple alignment at such position 

CDF ( m in2 ),„ Cumulative distribution function of the absolute difference between the variances of any two properties f!1 and 02 for a 
random combination of n amino acids 

CDF n ,n Cumulative distribution function for a random combination of n amino acids and a given property fl 

H 0 Null hypothesis in the hypothesis test for differences 

H, Alternative hypothesis in the hypothesis test for differences 

f Position number in the multiple alignment 

IDP Invariable determinant position 

j or k Amino acid numbering sub-index 

L Total number of physicochemical properties considered in the MOSST analysis 

M Number of possible pairwise combinations (subsets having two elements) of properties 

"I 2 / 

n Number of amino acids in a multiple alignment position (without taking gaps into account) 

N Total number of positions in the multiple alignment 

N([j,(j) Normal distribution function with unknown parameters f/(mean) and a (variance) 

n, Number of amino acids in position / of the multiple alignment (without taking gaps into account) 

nsSNP Non-synonymous single nucleotide polymorphism 

P i,(01,02),n = 1 " P;,(01,D2)J1 

Level of significance of the null hypothesis H 0 for the difference of the variances of the properties f!1 and 02 at position / of a 
multiple alignment containing n amino acids 

P i,(Or,Os),n = 1 " P;,(nr,ns),n 

Level of significance of the null hypothesis H 0 for the difference of the variances of the properties fir and Os at position / of a 
multiple alignment containing n amino acids 

p' m Value of the m th level of significance, obtained from the set of M levels of significance of the differences between variances for 

all the possible pairwise combinations of properties, sorted in ascending order, for a given position / of a multiple alignment 

P(m,n2) Probability of getting a certain value or less for the absolute difference Aa^^in between the variances of any two properties 

HI and f!2 

PDFx jQnrl Posterior probability distribution function for the values of the property £1 at position /, given the previously known group of n 
amino acids present in the multiple alignment at such position 

P, Global probability of the null hypothesis being true considering all the properties together at position /, i.e. probability of all the 

variances being equal at position / of a multiple alignment 

P;,(m,n2),n Probability of getting a certain value or less for the absolute difference A<J{m,cn) between the variances of any two properties 
01 and f!2 at position / of a multiple alignment containing n amino acids 

Punr,cis),n Probability of getting a certain value or less for the absolute difference &a^ ar ,ns) between the variances of any two properties 
Or and fls at position / of a multiple alignment containing n amino acids 

Pi,n,n Cumulative probability of getting a certain value or less for the variance (cr,' i n,n) for a given property CI in position ; of a multiple 

alignment containing n amino acids 

PMFci, n Discrete probability mass function for a random combination of n amino acids and any given property CI 

p a Cumulative probability of getting a certain value or less for the variance for a given property £1 

Qu Global probability for the occurrence of the amino acid j considering all properties together at position i in the multiple 

alignment, i.e. probability of the amino acid j satisfying all the physicochemical requirements at position i 

<Jt/.Oji+i or Probability of the amino acid j (any of the 20 natural amino acids) to be present at position / (containing n amino acids in the 

tfj,i,£l r ,n+l multiple alignment analyzed) in a new protein not included in the multiple alignment, according to property CI or fl r 

r or s Property numbering sub-index 

VDP Variable determinant position 

VIP Variable irrelevant position 

x Unknown quantity or variable 

XiQj or X{ Q ri j Measure, value or score of the physicochemical property fl or Cl r for the amino acid j at position / of a multiple alignment 

Xaj Measure, value or score of the physicochemical property fl for the amino acid j 

2kT(m,n2) Absolute difference between the variances of any two properties fll and C12 

4<r W m,Q2),n Absolute difference between the variances of any two properties £11 and £12 at position / of a multiple alignment containing n 
amino acids 

^°v,(fir,ns),n Absolute difference between the variances of any two properties fir and fls at position / of a multiple alignment containing n 
amino acids 

fj Mean (of a normal distribution) 
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Table 1 Nomenclature and abbreviations (Continued) 



Vi,a,n Arithmetical average of the physicochemical property £1 in position / of the multiple alignment containing a number of n 

amino acids 

fj n Arithmetical average of the physicochemical property £1 

a Variance (of a normal distribution) 

0"/,n,n Sample variance estimator (standard deviation) of the physicochemical property £1 in position / of the multiple alignment 

containing a number of n amino acids 

Oh Sample variance estimator (standard deviation) of the physicochemical property £1 

T/,Q, n +i Test statistic for the (n+1) th amino acid at position / of a multiple alignment containing n amino acids 

O Generic physicochemical property 

£11, £12, £13 Physicochemical properties 1, 2 and 3 



fir, fls or fl r , fl s Physicochemical properties r and s 



scale-free and position-free comparisons between con- 
servations at different positions and using different 
scales or properties. Interestingly, the probability value 
Pi,si,n could also be used as a significance value in 
hypothesis tests for the obtained variance t7 ijfl)JI . 
Determination of mutationally relevant positions 
Ideally, each property Q. is unique and mutually inde- 
pendent, and its contribution to the global structure and 
function of a protein is relatively important. For an 
amino acid position i in a multiple alignment, and given 
a cumulative probability Pi,si,n for each amino acid 
property, we define three possible cases into which the 
relationships between the different probabilities Pi^n 
for each amino acid property at a position i could be 
classified a priori. In Figure 4, these three possible cases 
are shown with reference to three different amino acid 
scales. These three cases, which are defined in the fol- 
lowing paragraphs, can be used as a basis to determine 
good mutagenesis objectives. 



Position i in 
the alignment 
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Figure 2 Example multiple alignment of seven protein amino 
acid sequences. Each example protein has 16, 15, 15, 14, 15, 12 

and 13 amino acids, respectively. The multiple alignment has a 
length of 18 positions, which means that every sequence has at 
least 2 gaps. The calculation of the mean and variance of the 
property £1 is shown in detail for position 9 of the alignment (i = 9), 
with 6 amino acids and one gap (n = 6: 1 histidine, 3 threonines 
and 2 serines). 



Invariable determinant position (IDP) The first case is 
defined by obtaining a high significance level, i.e. a very 
low Pi^n value, for every amino acid property in a posi- 
tion i. This implies a relevant conservation of all the 
properties at that position. These positions are usually 
identified in multiple alignments as those positions that 
stringently accept only one or sometimes two different 
specific amino acids and are thus readily identifiable and 
visible using simple methods. The change of one amino 
acid for another in the analyzed position would probably 
imply a drastic change in the value of one or more 
properties and therefore alter the conserved necessity 
for invariability at that position, causing a partial or 
total loss of some of the characteristic properties of the 
protein family. If it is desired to conserve such proper- 
ties, e.g. conservation of catalytic function or interaction 




'9,n,6 



Figure 3 Typical plots of the cumulative distribution functions 
(CDF) of sample variances. In these plots, a a is a calculated 
sample variance for any combination of n amino acids and p a is 
the associated probability of obtaining such sample variance value 
for the property £1 just randomly choosing n amino acids, and CDF 
profiles vary depending on the number of amino acids selected. 
Continuing with the example of Figure 2, the sample variance cr 9A6 
has an associated probability P9,n,6 that can be found using the 
corresponding CDF nn 
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J 



k 



No property conservation Required conservation of One conserved property, 
requirements all the properties the others are irrelevant 



Position in 
the alignment 



VARIABLE 
IRRELEVANT 
POSITION 
(VIP) 



INVARIABLE 
DETERMINANT 
POSITION 
(IDP) 



VARIABLE 
DETERMINANT 
POSITION 
(VDP) 



Figure 4 Scheme depicting the three possible conservation cases described in the text For each different position (i, j, k) of a multiple 
alignment, the significance levels P9,n,6 corresponding to three amino acid properties of the example in Figure 2 (£1,, fl 2 , fl 3 ) are plotted, to 
determine the differential conservation of the properties. 



sites, as would be the general case in conservative site- 
directed mutagenesis studies, then these positions 
should not be considered as possible mutagenesis objec- 
tives. A mutation in this type of position is usually dele- 
terious for the protein function in nsSNPs. 
Variable irrelevant position (VIP) The second typical 
case is when all the probabilities p^a.n associated to var- 
iances for all the amino acid properties at position i are 
very high. Here, the significance of conservation of 
those properties at that position is very low, i.e. conser- 
vation is irrelevant. The replacement of one amino acid 
by another in any of these positions should not be 
determinant for the protein family, as the conservation 
of any particular characteristic is not required. In other 
words, the amino acids present at these positions do not 
contribute fundamentally to any relevant characteristic 
of the protein family. Therefore, it is expected that 
mutagenesis at a variable irrelevant position should not 
have a large effect on the characteristics of a protein 
family. A nsSNP with these characteristics is probably 
silent and does not affect protein function. 
Variable determinant position (VDP) The third case 
corresponds to the case where, at the analyzed position 
i, there is a high conservation, i.e. a high Pi^n signifi- 
cance, in one property and a very low significance or 
conservation in the remaining properties. Then, for this 
position there is a very high tendency to conserve one 
of the independent amino acid properties, which are 
necessarily privileged, and a simultaneous tendency to 
variability in the other groups of rather irrelevant 



properties. Thus, this position is variable in the sense of 
admittance of global variability of some characteristics, 
but nevertheless is determinant for possession of a con- 
served group of characteristics, which have to be present 
to communicate common characteristics to the protein 
family. This makes this kind of position a main target to 
be mutated. A nsSNP in this type of position can affect 
protein function in a rather unpredictable but probably 
very determinant way. 

A conservative mutagenesis strategy that aims to pre- 
serve the main functional or interactional properties 
that characterize the protein family and to selectively 
alter secondary traits should be focused on mutating 
only VDPs, while a non-conservative mutagenesis strat- 
egy could also include IDPs. If the aim is to identify 
nsSNPs, mutations that could alter protein function are 
most probably located at IDPs and VDPs. Whichever 
the task and the expected result would be, it is evidently 
useful to identify VDPs, given their allowed amino acid 
variability and the simultaneous relative importance of 
conservation of one property in such position in the 
entire protein family. However, VDPs are much more 
difficult to find than IDPs, which makes VDPs prime 
"hidden" functionally relevant positions that cannot be 
readily identified by other existing automatic methods. 
The following sections will describe the proposed 
method to identify these positions. 
Identification of Variable Determinant Positions 
We described previously that the determination of the 
significance values Pi t a. tn for each position i in a 
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multiple alignment is based on the CDFs of the var- 
iances of random combinations of n amino acids. After 
CDFs are calculated for the variance of each amino acid 
property, the significance Pi t a,„ associated to each prop- 
erty can be calculated at each position. 

With these Pi,n,n values we can identify in an align- 
ment the VDPs described previously as hidden function- 
ally determinant positions where one of the properties is 
conserved, while others are not. To identify such posi- 
tions (Figure 1, step 10), it is necessary to determine 
whether significant differences exist between the var- 
iances of the different amino acid properties at a given 
position i. In fact, differences between the variance of 
one property and the variances of every other could be 
used as indicators of the presence of conservation of a 
single property in position i. However, only significant 
(not random) differences will indicate that position / is 
indeed a VDP. Hence, in the following section, we 
develop a method that uses hypothesis testing to identify 
VDPs by quantifying the magnitude of the significance 
of such differences and assigning a probability to classify 
position i as a VDP. 
Hypothesis test for differences 

In the same way that probabilities pa can be calculated 
and cumulative distribution functions can be con- 
structed for the variance of any given property 12, a 
probability /?rm,ii2) for all the possible differences Ao 
(m.m) between the variances of any two properties 121 
and 122 can be found (Figure 5). The probability value p 
(m,ft2) is equivalent to the probability of the absolute 
difference between variances being AfJ^i >a2 ) or less. 
From these differences and their associated probabilities, 
a corresponding CDF (m>a2 ),« can be calculated for a 
group of n amino acids, considering all the possible ran- 
dom combinations of variances and their differences 
(Figure 6). 




Figure 5 Scheme depicting the calculation of differences Ao~,, 
(nr,ns),„ between pairs of sample variances for a VDP. Large 
paired differences exist maximally only when one property is strictly 
conserved while the others are not. This is exploited to combine 
evidence by integrating individual significances from different pairs 
of properties. 




Figure 6 Typical plots of the cumulative distribution functions 
(CDF) of the differences of sample variances. In these plots, flcr n 
is a calculated difference between sample variances for any 
combination of n amino acids and p n is the associated probability 
of obtaining such difference of sample variances for the property fl 
just randomly choosing n amino acids, and the CDF profiles vary 
depending on the number of amino acids selected. Continuing 
with the example of Figure 2, a calculated difference A0g i( Q 1>n 2) >6 of 
the sample variances between the properties and Cl 2 at position 
i = 9 and n = 6, has an associated probability P9,(m,n2),6 that can 
be found using the corresponding CDF {m ,ca),n Therefore, the 
significance level of the null hypothesis H 0 is p'/,(m,o2),n = 1 " Pi. 
(m,n2),n 

\ J 

Therefore, for every position i with n amino acids in a 
multiple alignment, a hypothesis test can be devised to 
test whether a given difference between the variances of 
two given properties 111 and 122 is not random, i.e. Aa it 
(m,n2),H is significant. The null and alternative hypoth- 
eses for this test are, respectively: 

H 0 : The variances of both amino acid properties are 
equal (i.e. their difference A<Ji^ni,n2),n is random). 

Hj. The variances of both amino acid properties are 
different (i.e. their difference 40|,(oi,m),* is not 
random). 

Thus, CDF(ai,n2),H is the distribution function of the 
test statistic 40j j (ai > m) > » under the null hypothesis, i.e. 
random differences between the variances. Hence, CDF 
(Aa i ,(si l , a2 ),n = x) = Pi,(ni,ti2),n is the probabil- 
ity of the difference between the variances of two prop- 
erties 121 and 122 in a random combination of n amino 
acids being less than x (see the example in Figure 6). 
Therefore, p' i >(a .\,a.i),n = 1 - /»«,(m,n2),» is the probability 
(or level of significance) of the null hypothesis for the 
properties 121 and 122 (Figure 6). For very low values, 
under certain arbitrary limits, the null hypothesis is not 
significant and could be rejected in favor of the alterna- 
tive hypothesis. 

The partial significance of the differences between 
each pair of properties is calculated in step 7 of the 
algorithm. However, as there are many different proper- 
ties 12, for each position i in a multiple alignment, each 
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pair of properties will have a corresponding significance 
value for the difference between their variances. None of 
these differences or their related significance levels 
could be seen mathematically as the significance of the 
position by itself. The significance of the position must 
integrate all the individual significances in a global sig- 
nificance value. If the null hypothesis is significantly 
rejected for one single property when compared with all 
the other properties, position i can most likely be identi- 
fied as a VDP. In particular, to identify a VDP the global 
significance must be maximal when all differences 
except one are significant. 
integration of the individual significances 
For a given position i in the alignment comprising a 
group of n amino acids, every possible pair of properties 
lir and lis will have a difference /l(7;,(nr,i2s),M between 
their respective variances, and a significance level p' it 
(iir,ns),« associated to this difference. Each comparison 
between any pair of properties gives information to 
prove or disprove that the variances of the properties 
are different. Therefore, every py^Qs)^ can be consid- 
ered as the significance level of a single experiment or 
sampling test to reject the null hypothesis at position i. 
However, the level of significance of the global null 
hypothesis must take into account the level of signifi- 
cance of all the differences between every pair of prop- 
erty variances. Therefore, none of the p'i,(nr,tis),n alone 
is able to represent the significance of the position i on 
its own. 

Statistically, the significance levels obtained from dif- 
ferent experiments to demonstrate the same hypothesis 
could be combined in order to get a composite level of 
significance, as if a single experiment would be done 
with the combined evidence of all those partial experi- 
ments. This combination could be done using different 
procedures according to the features of the particular 
cases (e.g. the Fisher procedure [52,53], procedures 
based on the Bonferroni inequality [54,55] and the 
improved Simes procedure [55,56]). In our case, the dif- 
ferences of variances between each possible pair of 
properties are not independent, since the difference 
between the last pair of properties is a linear function of 
the others, so the Fisher procedure cannot be used. 
From the other two methods, the Simes procedure gives 
the best results, as its rejection region contains the 
rejection region of the Bonferroni methods and it is 
always more powerful than those, specially in the case 
of highly dependent test statistics [56], which is sus- 
pected to be our case. 

We propose the use of the Simes method to combine 
the significance values of the differences of variances 
between all the properties, in order to obtain a global 
significance value for the differences at position i (Figure 
1, step 8). The Simes method can be applied in a simple 



way: if p' m = p' lt p'2<— P'm is the value of the level of 
significance of each difference of variances between all 
the pairwise combinations of properties, in ascending 
order, the value of the global significance that considers 
all the individual partial significances for position / is: 

Pi = min(-p' m ) (3) 

The global probability Pi represents the probability of 
the null hypothesis being true, given the values obtained 
for the differences of variances between different prop- 
erties. In our case, P t is the probability of all the var- 
iances being equal. The lower this probability, the 
higher the accuracy with which the null hypothesis 
could be rejected, i.e. it is more probable that the var- 
iances of the different properties at position i in the 
analyzed multiple alignment could be different. Hence, 
low Pi values could be used to directly identify VDPs in 
a multiple alignment (step 10, Figure 1). 
Determination of the rejection regions for the null 
hypothesis 

Statistical practice classifies the levels of significance of a 
hypothesis test in different categorical levels, usually 
"not significant", "significant" and sometimes "very sig- 
nificant". This categorization is performed using a criter- 
ium that imposes an arbitrary limit between one 
significance level and another. Usually, a value of 0.05 is 
the limit between "non significant" and "significant" 
levels. However, multiple experimental implementations 
of the proposed algorithm indicate that this limit is not 
stringent enough to correctly classify VDPs apart from 
IDPs and VIPs. Preliminary evidence shows that posi- 
tions in a multiple alignment could be roughly classified 
into one of the following categories according to their 
significance levels: not significant {Pi > 0.01), significant 
(0.0005 <Pi < 0.01) or very significant {Pi < 0.0005). 

With this classification, critical values for the rejection 
region of the null hypothesis are determined (Figure 1, 
step 9) and VDPs in the target protein are those posi- 
tions identified as significant or very significant (Figure 
1, step 10). Accordingly, the other positions are classi- 
fied as IDPs, which cannot be mutated without radically 
altering the characteristics of the analyzed protein 
family, and VIPs, which are not good mutagenesis tar- 
gets because they are "filling" amino acids or amino 
acids that contribute with indiscernible and different 
characteristics to each particular protein. 
Identification of Invariable Determinant Positions 
The identification of IDPs can be done much more easily 
than the identification of VDPs. Normally, IDPs can be 
directly identified by visual inspection and looking for 
complete or almost complete amino acid conservations 
at any given position. However, the identification of IDPs 
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can be readily implemented as a part of the MOSST algo- 
rithm, using the same tools previously developed to iden- 
tify VDPs. In fact, according to Figure 4, IDPs have two 
characteristic features: the variances of all the amino acid 
properties must have very low values (i.e. must be highly 
significant) and the differences between those variances 
must also be very small (i.e. differences are insignificant). 
Therefore, any position i for which the global probability 
Pi is classified as not significant and the probability 
for every il is classified as significant, must be an IDP 
(Figure 1, step 12). 

Determination of the possibilities of mutation for each 
position 

Once the amino acid positions in a protein whose 
mutation could be functionally significant have been 
identified, the possible effects of amino acid substitu- 
tions at such positions are determined in a further 
step of the MOSST algorithm. A distribution analysis 
of the different amino acid properties at said position 
is done in order to identify a priori possible function- 
ally relevant or functionally silent amino acid changes 
in the protein, either to guide a rational mutagenesis 
strategy or to identify non-evident nsSNPs (Figure 1, 
step 11). 

The proposed method determines the probability of 
an amino acid being present at a determined position 
using a statistical method based on the Student i-test. 
This procedure starts with the previously calculated 
average (fi ,,«,„) and variance estimator ((7/,n,„) at a 
position i of the multiple alignment. For each position, 
each amino acid property is assumed to have a normal 
distribution N(fi,a), with unknown parameters fi and 
O. This assumption is reasonable, because it is 
assumed that at each position the average of each 
scale would represent the ideal characteristic to be ful- 
filled at that position and the standard deviation 
would represent the accuracy level of conservation of 
the characteristic. 

For a sample of n values X i>a>1 ...X itat „ obtained for an 
amino acid property 11 at a given position i, assuming a 
normal distribution, it is possible to predict the distribu- 
tion of a new value X i>an+1 . Mathematically, if we have 
a normally-distributed variable X sample mean value 
fti,ti,n and a sample variance G i>a>n for a position i, then 
the test statistic: 



CDF Xiantl =P(X iAn+1 <*) 



tj,a,n+i = , for n ^ 2 



(4) 



has a Student's t-distribution with n-1 degrees of 
freedom [57]. As the probability distribution of t /)£1)M +i 
is known, it is possible to calculate the distribution of 
Xi,a,n+i as: 



= P ( ti,a,n+l a i,a,n\j 1 + - + IM,Sl,n < X 



(5) 



= P 



X — IJ-i,Q,n 



Gi.Ci.nJ 1 + 



1 

n / 



CDFxf aml is the posterior cumulative distribution 
function for the values of the property 12 at position i, 
given the previously known group of n amino acids pre- 
sent in the multiple alignment at such position. From 
this CDFx jS2n+1 , a probability density function PDFx, a „ +1 
can be derived. Since each amino acid naturally has one 
particular value of property il, PDFx iQ „ tl allows the 
assignment of a probability q^n.n+i to each amino acid 
/ (any of the 20 natural amino acids) to be present at 
position i in a new protein not included in the multiple 
alignment (see example in Figure 7). This probability is 
directly associated with the property il and its distribu- 
tion PDF XiAntl . 

As each amino acid can have many different proper- 
ties ill, il2-- &l'> and each property will have a differ- 
ent PDF X;,j2 r , n +h then each amino acid will have L 
associated individual probabilities to be present at posi- 
tion i, which we will call <7;,;,n r ,n+i for each amino acid 



(Student's t distribution) 




n.aa Mi.Q.n J Qaa,.i.Q,n+] 

Probability of the amino acid "aa" 
to be present at the position /. 

Figure 7 Scheme depicting the posterior probability density 
function (PDF) of the values of a property O. Continuing with 
the example of Figure 2, given the previously known group of n 
amino acids present in the multiple alignment at position i for a 
given amino acid aa having a value X aaa assigned for the property 
£2, and the average for fl at position i being f*9,n,6. then the 
shadowed area represents the probability C\aa\,i,£l,n+\ that the 
amino acid aa could be present at the position i, according to the 
property CI. 
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The combined global probability Qj :i for the occurrence 
of each amino acid /' at position i is calculated as: 



L 




This combined global probability represents the prob- 
ability of every particular amino acid satisfying the 
requirements of all properties at position i, and there- 
fore is an indicator of the functional suitability of each 
amino acid in such position. 

Quantification of the functional suitability of the amino 
acids 

The former multi-property consensus method allows 
calculation of the global probability of occurrence of 
each amino acid at each position in a multiple align- 
ment. The probability value can also be used as an indi- 
cator of the functional impairment introduced by each 
amino acid at every analyzed position i. If the require- 
ments of the functional properties that have to be ful- 
filled in position i lead to a high probability for an 
amino acid to be in that position, it means that the 
amino acid does not impair the function of the protein. 
Inversely, if the probability is low, such amino acid can 
be functionally deleterious. 

For this last step of the analysis, a method based on 
the sorting of these probabilities of occurrence in des- 
cending order and the construction of a scree plot with 
the ordered values can be used. An example of this pro- 
cedure is shown in Figure 8. The scree plot at each 
position i could help in getting the preference level or 




(sorted from highest to 
lowest Q probability) 



Figure 8 Scree plot of the probabilities Q jti of each amino acid 
being present at position /'. The probabilities are sorted from 
highest (left) to lowest (right). A fall contrast or scree criterium is 
applied to this plot to identify a cut point in the curve (dashed 
line), where the highest probability factors are chosen up to a point 
where the curve becomes approximately horizontal. 



reliance with which the analyzed position could be 
occupied by each of the amino acids, based on the 
required physicochemical properties at such position in 
the protein. An analysis of the scree plot and its curva- 
ture can serve to identify a cut point (as indicated in 
Figure 8) in the curve. This point can be determined by 
using a "fall contrast" or "scree" criterium [58], where 
the highest probability factors are chosen up to a point 
where the curve becomes approximately horizontal. A 
second criterium can also be used, where the highest 
probability factors are chosen to explain together at 
least a predefined high probability (usually 95%) or until 
the last factor has a non-significant probability (usually 
less than 5%). Any criterium will define two sections in 
the scree plot: on the left a curve section including an 
amino acid group with the highest probabilities and on 
the right a section with those having the lowest prob- 
abilities. This classification allows separation of a high- 
probability group comprising functionally non-disruptive 
amino acids from a low-probability group of functionally 
deleterious amino acids. Hence, Qjj can be used as a 
basis to identify non-evident nsSNPs and to design 
rational site-directed mutagenesis strategies for each 
position i in a protein. 

Implementation of the algorithm 

Three main issues were solved to implement the algo- 
rithm. These aspects will not be analyzed in detail in 
this paper, because they are not directly related to the 
algorithm theory and design, although they must be 
solved to get a good implementation thereof. 

The first aspect is the selection of suitable property 
vectors that could be successfully used in order to get 
the most accurate information from the MOSST algo- 
rithm and to exploit the classification abilities of the 
algorithm in an optimal way. Though the general algo- 
rithm has been presented as being able to use any group 
of property vectors to perform the proposed analysis, we 
selected a minimal number of three normalized vectors 
that are mathematically orthogonal to each other to 
optimize the statistical potency of the results. 

The second issue is the calculation of the probability 
distributions of the averages and variances of the 
selected optimal vectors, which are necessary for the 
implementation of the algorithm as set forth in the pre- 
sent paper. We have assessed that the Law of Large 
Numbers is not accurately applicable for combinations 
of less than 30 amino acids given the optimal vectors 
previously obtained. Even though the probabilities of the 
different amino acid combinations could be calculated 
in a direct way, it is computationally difficult and time- 
consuming for combinations of more than 5 or 6 amino 
acids. We developed a direct calculation method to get 
an approximate distribution for combinations from 2 to 
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30 amino acids, which analysis and explanation is out of 
the scope of this work. 

The last aspect is the determination of the right signif- 
icance levels that should be taken into account to clas- 
sify each position as "non significant", "significant" and 
"very significant" when determining the rejection regions 
for the null hypothesis of equality of variances. As we 
mentioned earlier, values of 0.01 and 0.005 could be 
preliminary used to select between these classifications, 
which are more stringent than those values usually used 
in statistical testing. However, our experience imple- 
menting the algorithm has shown that the most interest- 
ing positions in a protein alignment (which we called 
"primary positions") have to be determined case by case, 
based on characteristics of each different multiple 
alignment. In our experience, the implementation of the 
algorithm included this data analysis in order to get the 
most out of the results of the general algorithm. 

The MOSST algorithm has been implemented with the 
details mentioned above as a Graphical User Interface 
(GUI) in MATLAB (see Additional File 1, Additional File 
2 and Additional File 3). Given the nature of MOSST 
results, it is easier to display them in a graphical interface 
in order to allow the user to have an integral vision over 
them and to modify the parameters of the algorithm on- 
the-run to obtain the best possible results for each given 
analysis. One of the major advantages we devised from the 
use of an interactive GUI is the definition of "primary 
positions". We defined primary positions as those "very 
significant" positions that have exceptionally high mutabil- 
ity scores as defined by the statistic theory behind the 
algorithm. By using MOSST GUI, the user can select both 
manually or semi-automatically the threshold significance 
value above which a "very significant" position becomes a 
"primary position" and thus a primary mutagenesis 
objective. 

The semi-automatic method advantage over the man- 
ual method is that it helps the user to select a suitable 
threshold by presenting a differential scree plot where 
the user can interactively select the threshold by com- 
paring the differences (step sizes) between similar 
ordered significance values. The optimum threshold will 
be a "very significant" significance value having the lar- 
gest possible difference with immediately higher (adja- 
cent) significance values. If two or more significance 
values fulfill these criteria, the largest one is most appro- 
priately selected. This empirical rule is applied by the 
user with his/her own judgment to select a proper 
threshold value. 

The empirical selection of "primary positions" proved 
to be an additional tool to narrow the search for muta- 
ble positions in a protein family. An example of the 
relevance of determining primary positions is shown in 
the study of glycosyl hydrolases belonging to family 16. 



Testing: mutagenesis analysis of endoglucanases 
belonging to family 16 of the glycosyl hydrolases 
Selection of representative proteins 

MOSST was used for the analysis of representative 
endoglucanases classified as glycosyl hydrolases belong- 
ing to family 16. They have several appropriate charac- 
teristics: (1) a very close structural and folding similarity 
despite them not having a very high degree of sequence 
similarity, (2) well characterized residues that participate 
in the interaction with the substrates at the active site 
and define the substrate specificity of the family, (3) well 
characterized active residues that are indispensable for 
catalysis, (4) two well defined and populated enzyme 
families with distinct substrate specificities and no 
known structure-function correlations, namely liche- 
nases (endo-l,3-l,4-P-D-glucanases, EC 3.2.1.73) and 
laminarinases (endo-l,3-p-D-glucanases, EC 3.2.1.39, 
endo-l,3(4)-P-D-glucanases, EC 3.2.1.6), and (5) large 
availability of experimental data, including mutagenic 
studies. These characteristics were considered ideal to 
test the ability of MOSST to determine functionally 
relevant amino acid positions against experimentally 
tested results, since this allows the analysis of amino 
acid mutations purely by the effect of their side groups, 
while leaving other factors (molecular structure, catalytic 
properties, loop variations) outside of the analysis. 

Among these enzymes, we selected Cellulosimicro- 
bium cellulans's Bglll as our reference enzyme. Proteins 
included in the analysis were selected by their similarity 
with Bglll and both the conservation of folding pattern 
and enzymatic activity. Table 2 shows the representative 
set of proteins selected and included in this study. 
Obtention of a multiple alignment 

Following the flow diagram of the MOSST algorithm, 
these proteins were aligned using ClustalW (http://www. 
ebi.ac.uk/clustalw/, [12]). Preliminary tests using differ- 
ent substitution matrices and different gap opening and 
extension penalties were performed. The combination 
that exhibited best results was obtained using an identity 
matrix as substitution matrix and minimal values for 
gap creation and extension. This combination empiri- 
cally gives more importance to conserved sectors, which 
in our case correspond to secondary structure and fold- 
ing sectors that are conserved in the family. This strat- 
egy is concordant with the fact that the studied group of 
proteins is a very structurally conserved enzyme family, 
even when conservation is lower at the sequence level 
due to the difference in activities of the proteins. 
Redundancy analysis 

For this objective, a protein clustering dendrogram and 
an agglomeration plot were constructed according to 
the similarity percentage between the proteins in the 
multiple alignment (Figure 9). An optimal separation 
between sub-families is obtained using a similarity 
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Table 2 Analyzed endoglucanases belonging to glycosyl hydrolases family 16 



# Swiss Prot EMBL or GenBank 
code code 



Description 



Organism 



Enzymatic 
classification 



9 
10 

1 1 

12 

13 
14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 
26 



068641 
Q51333 
Q60039 

Q9WXN1 
073951 
052754 
P45798 

Q45095 
P23903 

Q9Z3Q2 

033680 

P07980 

Q45691 
P04957 



P27051 



P45797 

P29716 
P29716 
014412 
P37073 
Q59328 
Q26660 



AF052745 0-1 ,3-glucanase II (Bglll) 

U56935 p-1 ,3-glucanase Ha (Bgllla) 

Z47974 Laminarinase 

AE001690 Laminarinase 

AF013169 endo-p-1,3-glucanase (precursor) 

AF047003 Laminarinase 

U04836 p-glucanase (precursor) 

JN0772 p-1,3-glucanase bglH (precursor) 

M34503 Glucan endo-l,3-P-glucosidase A1 (precursor) 

AJ225896 endo-1,3-1,4-p-glucanase eglC 

U89164 endo-1,3-1,4-p-glucanase exsH 

M15674 p-glucanase (precursor) 

U60830 endo-p-1,3-1,4-glucanase 

X00754 p-glucanase (precursor) 

CAA81096 (Z25877) hybrid endo-1,3-1,4-p-glucanase (synthetic 

construct) 



X57279 



P-glucanase (precursor) 



CAA81 092 (Z25873) hybrid endo-1, 3-1, 4-p-glucanase (synthetic S. 

construct) 

CAA81094 (Z25875) hybrid endo-1, 3-1, 4-p-glucanase (synthetic S. 

construct) 



X57094 

X58392 
X63355 (JS0611) 

U63813 
M84339 (A48378) 

X89732 

U4971 1 
U42580 (AAC96462) 



P-glucanase (precursor) 

P-glucanase (precursor) 

P-glucanase (precursor) 

p-glucanase (precursor) 

P-glucanase (precursor) 

Endo-1, 3(4)-p-glucanase 

p-1,3-glucanase 
p-1,3-glucanase 



Cellulosimicrobium cellulans 
C. cellulans 
Thermotoga neapoiitana 

Thermotoga maritima 

Pyrococcus furiosus 
Rhodothermus marinus 
ft. marinus 

Bacillus circulans 
B. circulans 

Rhizobium meliloti 

ft. meliloti 

Bacillus amyloliquefaciens 

Bacillus subtilis 
B subtilis 

Bacillus macerans I B. 
amyloliquefaciens 

Bacillus licheniformis 
macerans I B. amyloliquefaciens 
macerans I B. amyloliquefaciens 
Paenibacillus polymyxa 
Clostridium thermocellum 
C. thermocellum 
Orpinomyces sp. PC-2 
Bacillus brevis 
C. thermocellum 

Strongylocentrotus purpuratus 

Paramecium bursaria 
chlorella virus (PBCV-1) 



Laminarinase 

Laminarinase 

Laminarinase 
(EC 3.2.1.39) 

Laminarinase 

Putative laminarinase 

Laminarinase 

Lichenase 
(EC 3.2.1.73) 

Putative lichenase 

Laminarinase 
(EC 3.2.1.39) 

Putative lichenase 
(EC 3.2.1.-) 

Putative lichenase 
(EC 3.2.1.-) 

Lichenase 
(EC 3.2.1.73) 

Putative lichenase 

Lichenase 
(EC 3.2.1.73) 

Putative lichenase 

Lichenase 
(EC 3.2.1.73) 

Putative lichenase 

Putative lichenase 

Lichenase 
(EC 3.2.1.73) 

Lichenase 
(EC 3.2.1.73) 

Lichenase 
(EC 3.2.1.73) 

Lichenase 
(EC 3.2.1.73) 

Lichenase 
(EC 3.2.1.73) 

Laminarinase 
(EC 3.2.1.6) 

Putative laminarinase 

Putative laminarinase 



Descriptor codes in the multiple alignment are highlighted in bold characters. 



percentage cutoff of 86%. Three groups of similar 
enzymes were represented by only one of them: (i) syn- 
thetic hybrid lichenases 17 and 18 from Bacillus macer- 
ans I Bacillus amyloliquefaciens (enzyme numbers are 
given with reference to the order numbering in Table 
2), (ii) Bacillus amyloliquefaciens (12) and Bacillus subti- 
lis (13 and 14) lichenases, and (iii) two enzymes from 
Rhodothermus marinus, one of them classified as a lami- 
narinase (6) and the other classified as a lichenase (7). 



The remaining proteins, even those belonging to the 
same species, are considered different enough to be 
included in the analysis with no added redundancy. 

It can be observed in the dendrogram a subdivision of 
the proteins in the alignment in two large sub-families 
corresponding mainly to the division between lichenases 
(left-hand sub-family in the dendrogram) and laminari- 
nases (right-hand sub-family). This observation indicates 
that the information contained in the multiple 
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Figure 9 Auxiliary plots for redundancy removal in the analyzed protein family The top dendrogram shows the clustering of proteins 
according to the distance (similarity percentage) between them. The bottom plot is an agglomeration distance plot of the top dendrogram. In 
both plots a horizontal line representing a similarity percentage of 86% that was taken as the limit over which two proteins were considered as 
identical. This value is the minimal value within the most pronounced step in the agglomeration distance plot. The numbers of the different 
proteins are the order numbers assigned in Table 2. 



alignment is also able to discriminate between func- 
tional aspects of the enzymes. 
Calculation of the statistic values and significances 

For each of the positions of the non-redundant multiple 
alignment, the associated variance in each of the three 
principal components and the differences between the 
three components were calculated. The values of said 
differences were used to calculate the global significance 
of each position and to classify said significance accord- 
ing to the criteria exposed in the previous paper of this 
series. This allowed the classification of positions in the 
alignment as "non-significant", "significant" and "very 
significant". In addition, a new category of "primary 
position" was defined when the global significance of 
the position is included in the higher percentile of the 
values in the distribution of global variances as 



described in the practical implementation of MOSST 
analysis. 

Figure 10 shows the graphical results obtained for var- 
iance probabilities for each component and each posi- 
tion, and the results obtained for global significances at 
each position. In these plots, many positions can be 
identified as "primary" mutagenesis targets as defined 
before and also many "very significant" positions, all of 
them scattered along the catalytic domain of endogluca- 
nases belonging to glycosyl hydrolases family 16. How- 
ever, the distribution of primary mutagenesis targets is 
not uniform, but said targets are concentrated in more 
or less defined sectors in the amino acid sequence of 
the protein, which gives a hint about the existence of 
determinant and non-determinant regions for this 
enzyme family. 
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Figure 10 Result plots for the global significance of the positions (top) and the significances of variances in each component 
(bottom). NLSDV: negative of the base-10 logarithm of the significance of the difference of variances; NLSV: negative of the base-10 logarithm 
of the significance of variances. 



Structural mapping of primary positions 

The determined primary positions can be mapped over 
a model of the tridimensional structure of Bglll. In this 
structural mapping a spatial distribution pattern of sec- 
tors in which variable determinant positions are concen- 
trated can be observed, i.e. the most promising 
mutagenesis targets for this protein family (Figure 11). 
The active site of the protein family is comprised widely 
by amino acids in positions 45-49 (Asn26-Gln30), 167- 
179 (Ile89-Ser99), 187-196 (Serl07-Asnll6), 205-210 
(Hisl25-Glyl30), 214-222 (Glyl32-Ilel39), 307-311 
(Phel91-Phel95) and 315-319 (Leul99-Val203) in the 
multiple alignment. A detailed analysis of active site 
amino acids according to the algorithm implemented in 
this work is shown in Figure 12. Inside the active site, 
variable determinant positions were identified as the fol- 
lowing amino acids using Bglll numbering (and their 
position in the multiple alignment indicated between 
brackets): Leu29 (48), Trp90 (168), Phe93 (171), Met95 
(173), Leu96 (174), Glyl09 (189), Metll4 (194), Glyl26 
(206), Vall28 (208), Hisl29 (209), Glyl30 (210), Phel91 
(307), Phel95 (311), Leul99 (315) and Ala202 (318). 



Predictive value of the algorithm: comparative analysis of 
laminarinases 

The MOSST algorithm can be applied only to an 
enzyme sub-family. In our case, these sub-families can 
be determined from the redundancy removal dendro- 
gram and are associated with enzyme function differ- 
ences, as mentioned earlier. It is interesting to apply this 
analysis procedure to laminarinases and compare these 
results with the variable determinant positions obtained 
for the entire family. Therefore, variable determinant 
positions were determined for laminarinases using a 
similar procedure to that used for the total protein 
family, and these positions were compared with those 
obtained for all family 16 glucanases. The active site 
structure of Bglll is represented in Figure 13 with differ- 
entiated variable determinant amino acids for family 16 
glucanases and variable determinant positions for the 
laminarinase sub-family. Logically, if a position in the 
active site is variable for laminarinases as a family then 
said position cannot be significant to determine sub- 
strate specificity of these enzymes. Instead, if a position 
is variable determinant for family 16 glucanases but not 
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Figure 11 3D mapping of the amino acids onto the 3D structure of Bglll. Mutagenically interesting positions (light grey) are mapped over 
the molecular structure of the catalytic domain of Bglll, selected as a representative structure of family 16 glycosyl hydrolases (order number 1 
in Table 2). This figure is a cross-eyed stereogram. 



for laminarinases, then it is logical to assume that said 
position could be relevant in the determination of lami- 
narinase specificity when compared with the other 
enzymes of the same family. 

Under this assumption, the comparative analysis using 
the MOSST algorithm shows that the following active 
site amino acids (having either exposed or buried side 
chains) can have a determinant function over laminari- 
nase specificity: Leu29, Trp90, Glyl09, Glyl30, Phel91, 
Phel95 and Ala202. It is interesting to asses that some 
of these amino acids have been identified in the analysis 
of the molecular model of Bglll and other family 16 glu- 
canases as amino acids that participate in substrate 
binding and/or interactions [59,60]. This observation 
corroborates the ability of the MOSST algorithm to 
identify relevant positions in a protein. 
Predictive value of the algorithm: analysis of histidine 129 
Position 209 (Figure 12), which corresponds to histidine 
129 in Bglll, is very remarkable. This position has been 
identified in the structural analysis of Bglll as an amino 
acid that could be implied in the chemical reactions cat- 
alyzed by this enzyme [61]. 

The MOSST algorithm marks position 209, i.e. histi- 
dine 129, in Bglll as a primary mutagenesis objective 
and therefore this fact can be exploited to obtain inter- 
esting variants of the enzyme, possibly having different 
catalytic properties due to the position occupied by said 
amino acid in the active site. It should be noted that 
this histidine is classified as a primary amino acid both 
for the global family of proteins under study and for 
laminarinases, and so a possible substrate interaction 
function should be discarded, at least in a first approach. 



MOSST algorithm suggests for said position the pre- 
sence of a histidine residue (with a 93.5% of probability) 
and the alternative presence of a phenylalanine residue 
(2.4% probability) or a glutamine residue (1.5% probabil- 
ity). The remaining amino acid have probabilities of less 
than 1% each and their combined probabilities only 
adds to 2.6%, so their presence is not considered rele- 
vant in said position. The real nature and contribution 
of each proposed alternative could be analyzed using 
other predictive techniques, in advance to mutagenic 
experiments, in order to decide which of said variants 
would be the most favorable one to obtain the desired 
effects in the protein. It is also possible to analyze all 
the proposed alternatives, producing in the laboratory 
the corresponding mutant enzymes and analyzing their 
new properties. 

The contribution of this amino acid and the possible 
mutations performed on it to the enzymatic catalysis 
could be assessed by means of an additional analysis, for 
example, a kinetic analysis of a catalytic model. In a for- 
mer work [61], using structural and mechanistic knowl- 
edge about glycosyl hydrolases from families 7, 10, and 
16, we have formulated a mathematical model that can 
include ionizable residues in the active site and incor- 
porating electrostatic influences via acid dissociation 
equilibrium constants and chemical relationships such 
as hydrogen bonds. The results of the simulations indi- 
cated a clear shift in the pH dependence of activity for 
the enzymes only when a close interrelation (hydrogen 
bond) between a catalytic glutamate and histidine 129 is 
taken into account, which is concordant with experi- 
mental evidence with Bglll (manuscript in preparation). 
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Figure 12 Result plots for the amino acids that form the active site of Bglll Global significance of positions and significances of variances 
for each component, for positions in the multiple alignment corresponding to amino acids that form the active site of Bglll (and other family 16 
glycosyl hydrolases): (a) positions 20-60; (b) positions 160-200; (c) positions 200-240; and (d) positions 290-330. 



Moreover, we demonstrated that the presence of the 
kind of chemical interaction proposed could provide sta- 
bilization of the activity in the presence of environmen- 
tal, structural, pH and electrostatic variations. The 
results suggested a new way to modify, via site-directed 
mutagenesis, the acid dissociation of one of the catalytic 
residues in the active site independently of the other, 
which could have clear advantages over the purely elec- 
trostatic modifications that usually affect both residues 
simultaneously. 

Under the light of this previous work, MOSST analysis 
results were very striking in their ability to identify this 
residue as relevant even without any previous knowledge 
or suspicion about its relevance or function in this 
enzyme family. In our opinion, this clearly demonstrates 
the usefulness of MOSST as a general tool for the 
design of new site-directed mutagenesis strategies in 
protein families. 



Conclusions 

A statistical procedure has been designed and presented 
to semi-automatically identify functionally significant 
mutable positions in a protein, based on the conserva- 
tion of physicochemical properties. Such positions are 
identified and classified into three groups, according to 
the influence their mutation could have on protein func- 
tion. Those in which a mutation does not alter the func- 
tion and basic characteristics of the protein, but do 
change them slightly, and those in which a mutation is 
totally deleterious for the protein are the most relevant 
positions to look for nsSNPs, while only the former are 
important when trying to develop site-directed muta- 
genesis strategies so that variants with improved proper- 
ties could be generated. 

Amino acid positions are also classified in three 
groups: variable irrelevant positions (VIPs), invariable 
determinant positions (IDPs) and variable determinant 
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Figure 13 Comparative mapping of primary variable determinant positions Primary variable determinant positions for all the studied 
protein family (lichenases and laminarinases) are shown in white and black. White positions are variable determinant amino acids both for 
lichenases and laminarinases, while black positions are primary variable determinant positions specific for laminarinases. This figure is a cross- 
eyed stereogram. 



positions (VDPs). The recognition and identification of 
VDPs is especially relevant, since these are "hidden" 
functionally relevant positions that cannot be easily 
identified by other existing automatic methods. This 
classification serves as a basis to rationally identify rele- 
vant amino acid positions in a protein in the frame of a 
rational design strategy or in the identification of 
nsSNPs. In this way, a conservative mutagenesis strategy 
that aims to preserve the main properties of a protein 
family should be focused on mutating only VDPs, while 
a non-conservative mutagenesis strategy could also 
include IDPs. If the aim is to identify nsSNPs, mutations 
that could alter protein function are most probably 
located at IDPs and VDPs. 

Once these key functional positions are identified, the 
statistical distribution of the relevant physicochemical 
properties at each protein position is analyzed to get a 
list of the 20 common protein amino acids ordered 
according to the global probability with which they can 
conform to the required property profile of each rele- 
vant position. This ordered list is divided into a group 
of functionally non-disruptive amino acids and a second 
group of functionally deleterious amino acids. 

These results become a discriminative tool in the 
selection and elaboration of rational mutagenesis strate- 
gies for the protein. They can also be used to predict if 
a given nsSNP, identified, for instance, in a genomic- 
scale analysis, can have a functional implication for a 
particular protein and which nsSNPs are most likely to 
be functionally silent for a protein. This analytical tool 
could be used to rapidly discard any irrelevant nsSNP 
and guide the research focus toward functionally 



significant mutations. This approach also has the advan- 
tage that not only conserved amino acid positions in a 
protein family can be labeled as functionally relevant, 
but also non-conserved amino acid positions can be 
identified having a functional effect. 

The MOSST algorithm has been implemented as a 
MATLAB GUI and used to analyze endoglucanases 
belonging to glycosyl hydrolases family 16, in order to 
determine interesting mutagenesis targets on them. The 
analysis has indicated amino acids that could be 
mutated in p-l,3-endoglucanase Bglll of C. cellulans to 
obtain critical changes in the enzymatic activity: putative 
substrate specificity amino acid determinants for liche- 
nase and laminarinase activity and determinants of pH- 
activity profiles for these enzyme family, thus confirming 
MOSST performance as a predictive tool for the study 
of functionally relevant mutations. 

The proposed methodology also has limitations, espe- 
cially in that it uses only information derived from a 
multiple alignment and the statistical result for each 
position will strongly depend on the quality of the align- 
ment and it does not prevent a priori this significance 
level to vary between one alignment and another. In this 
sense, the quality of the alignment is determinant and 
should be carefully controlled. Moreover, the algorithm 
does not give any suggestion about the real nature and 
contribution of relevant amino acids to the structure 
and function of the protein, but the predictions could 
be analyzed using other predictive techniques or tested 
by mutagenesis experiments. Another limitation is 
related to the fact that MOSST operates on single posi- 
tions in a multiple alignment and all statistic parameters 
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are calculated for each position without taking into con- 
sideration the simultaneous occurrence of mutations in 
other positions that can compensate for the effect of the 
first. This excludes the study of correlated compensatory 
mutations using MOSST in its current version. 

Although the developed procedure does not give any 
indication about the functional implications of the 
amino acids positions identified as relevant, an unre- 
lated analysis that use another type of information 
beyond the statistical inference performed on the multi- 
ple alignment by MOSST could shed light on their con- 
tribution to the protein function or structure. 
Specifically, classical methods to study experimentally 
and/or predictively the structure-function relationships 
in proteins can provide this kind of external informa- 
tion. For example, this type of analysis has been used to 
test the functional relevance of the amino acids identi- 
fied by MOSST in family 16 glycosyl hydrolases. In the 
case of position 209 of the alignment of these proteins, 
a mechanistic-electrostatic analysis yielded an explana- 
tion of the functional contribution of this amino acid 
position to catalysis [61]. The implementation and test- 
ing results set forth in this work show a promising per- 
formance of this technique as a valuable bioinformatics 
tool to aid in the development of new protein variants 
and to aid in the understanding of function-structure 
relationships in proteins. 

Methods 

Determination of optimal classification properties for 
MOSST 

55 properties measured by scales that are used to rate 
and sort amino acids were selected from the literature, 
including the most frequently used scales of hydropho- 
bicity or hydrophilicity and secondary structure confor- 
mational scales, as well as many others based on 
different chemical and physical properties of amino 
acids. 

To find the underlying organization in this varied 
group of characteristics and remove data redundancy, 
property scales were normalized and a clustering analy- 
sis was performed on them. The analysis indicated that 
it is not necessary to use a larger number of scales to 
obtain a more accurate classification of the natural 
amino acids. The clustering analyses performed classi- 
fied the 55 amino acid scales considered in 7 clusters of 
variables, each cluster sharing similar characteristics and 
tendencies and representing a defined physicochemical 
property. 

To obtain a set of independent (orthogonal) variables, 
a principal component analysis was performed over the 
seven vectors obtained from the cluster analysis. Three 
principal component vectors were extracted from the 
set of seven vectors that resulted from the cluster 



analysis, these three factors representing 94.1% of the 
variance within said vectors and normalized. A physical 
representation was assigned to each of these factors 
using correlation analyses with the original variables. An 
amino acid classification test was performed using these 
three orthogonal properties, which agrees with the prac- 
tical biochemical experience (data not shown). 

Variance and average distributions for each orthogonal 
property 

The cumulative distribution functions (CDFs) of the var- 
iances calculated for the population of random combi- 
nations of n amino acids were determined for each 
extracted property, calculating the average and standard 
deviation for all the possible amino acid combinations 
of n = 2 ... 30 using a recursive discretization algorithm. 
The distributions for combinations of higher numbers 
of amino acids was estimated as equal to the combina- 
tion of 30 amino acids, following the Law of Large 
Numbers. A similar approach was used to calculate the 
CDFs and PDFs for combinations of n amino acids. The 
probability distribution function of the differences of 
variances between any two components was calculated 
following the same methodological approach employed 
in the calculation of the PDFs for the variances of differ- 
ent amino acid combinations. 

Practical implementation of MOSST algorithm 

MOSST results were implemented in a graphical inter- 
face in order to allow the user to have an integral vision 
over them and to modify the parameters of the algo- 
rithm on-the-run to obtain the best possible results for 
each given analysis. In order to achieve this goal, the 
MOSST algorithm was implemented using MATLAB 
Graphical User Interface (GUI). MATLAB also provided 
the mathematical routines to allow a fast and robust 
mathematical treatment of the data and results. 

The use of this GUI allows manual determination of 
"primary positions", i.e. those "very significant" positions 
that have exceptionally high mutability scores as defined 
by the statistic theory behind the algorithm. By using 
the MOSST graphical interface, the user can select both 
manually or semi-automatically the threshold signifi- 
cance value above which a "very significant" position 
becomes a "primary position" and thus a primary muta- 
genesis objective. 

Additional File 1 annexed to this manuscript includes 
general instructions to install and use the MATLAB 
GUI implementation of MOSST, Additional File 2 con- 
tains all routines for the MATLAB GUI implementation 
of MOSST and exemplary alignment files for family 16 
glycosyl hydrolases, and Additional File 3 is a basic user 
guide to run and operate the MOSST MATLAB GUI 
implementation. 
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